This report will investigate and analyse the data set of policing from Dallas for the year of 2016. The analysis will have but not limited to exploration of data set, numerical representation of figures as well as the graphical presentation of the same. Exploration of data set will also focus on identification and dealing with the missing values and to identify which of the features are important in the data set.
This report is focused on finding answers to the questions such as
These research objectives just lay out the foundation but this report contains much more than these
Firstly, we will read the data set and will examine the number of features and observations we have. Also, we will look at the summary statistics of the data set and this will enable us in identification of missing values for each feature. Some missing values can be discarded using the complete case method whereas the important features cannot be discarded and can therefore be imputed with mean/median or through multiple imputations. The code below will read the data set from csv file and show the top 20 observation of the data set for each feature.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
setwd("C:/Users/Admin/Downloads")
policing <- read.csv('Prepped data.csv', header = TRUE)
Looking at the data set, we can see that the first row is useless as it contains the names of the features and can be removed. Therefore, we will remove the first row of each feature that contains the names of the features which we already have as variable names.
policing <- policing[-1,]
Further, we will look at the characteristics of the data set to check for data structure, dimensions and names. By running the summary, and Dim functions, we get to know that there we have 2383 observations in the data set and 47 different features. And, most of the feature’s data type is ‘Character’. However, we still have some features containing numerical values and we can differentiate between numerical and non-numerical features using R built in function.
dim(policing)
## [1] 2383 47
Before we proceed further to the analysis, we should verify that we have the necessary libraries installed and called for this specific project to avoid the errors that we may face due to unavailability of those required libraries.
library(dplyr)
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyr)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.1.3
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(finalfit)
## Warning: package 'finalfit' was built under R version 4.1.3
library(scales)
library(ggpol)
## Warning: package 'ggpol' was built under R version 4.1.3
As we have looked through the macros of our data which includes the number of features, observations and data types. The data type “Character” will not allow us to proceed with the analysis because data contains some numeric values which are mistakenly classed as character but indeed they are not. Therefore, we will first assign the features proper classes that they actually belong to.
policing$OFFICER_ID <- as.integer(as.character(policing$OFFICER_ID))
policing$OFFICER_YEARS_ON_FORCE<- as.integer(as.character(policing$OFFICER_YEARS_ON_FORCE))
policing[, 12] <- as.numeric(policing[, 12])
policing[, 20] <- as.numeric(policing[, 20])
policing[, 21] <- as.numeric(policing[, 21])
policing[, 22] <- as.numeric(policing[, 22])
policing[, 32] <- as.numeric(policing[, 32])
policing[, 33] <- as.numeric(policing[, 33])
policing[, 5] <- as.factor(policing[, 5])
policing[, 6] <- as.factor(policing[, 6])
policing[, 9] <- as.factor(policing[, 9])
policing[, 11] <- as.factor(policing[, 11])
policing[, 13] <- as.factor(policing[, 13])
policing[, 14] <- as.factor(policing[, 14])
policing[, 15] <- as.factor(policing[, 15])
policing[, 17] <- as.factor(policing[, 17])
policing[, 18] <- as.factor(policing[, 18])
policing[, 23] <- as.factor(policing[, 23])
policing[, 24] <- as.factor(policing[, 24])
policing[, 34] <- as.factor(policing[, 34])
policing[, 35] <- as.factor(policing[, 35])
we have the date feature in the data set which needs proper class categorisation. Therefore, we will give date feature the proper class and will add this column as an additional feature in the data set to keep the original data as it is. After adding the extra feature, now we have two features that are basically same making the count of total features to 48 but keeping both does not effect the consumption of time as well as computationally cost effective.
new_date <- as.Date(policing$INCIDENT_DATE, "%m/%d/%Y")
policing['new_date'] <- new_date
hiringbyy <- as.Date(policing$OFFICER_HIRE_DATE, "%m/%d/%Y")
policing['yearwisehiring'] <- year(hiringbyy)
dim(policing)
## [1] 2383 49
A new subset having numerical class is created which will ease the process of utilizing the features containing numerical values and for further processing to visualize the results.
Also, we will look at the distribution of the numeric data set using the histograms.
policing.numeric <- policing[,sapply(policing, is.numeric)]
dim(policing.numeric)
## [1] 2383 9
par(mfrow=c(3, 3))
colnames <- dimnames(policing.numeric)[[2]]
for (i in 1:8) {
hist(policing.numeric[,i], main=colnames[i], probability=TRUE, col="gray", border="white")
}
Now, we will look at the percentage of the missing values in each of the feature which will enable us to make a decision on which of the features are redundant and can be removed to avoid the complexity in analysis as well as to decrease the computational cost.
sum(is.na.data.frame(policing))
## [1] 110
missing_glimpse(policing, dependent = NULL, explanatory = NULL, digits = 1)
## label
## INCIDENT_DATE INCIDENT_DATE
## INCIDENT_TIME INCIDENT_TIME
## UOF_NUMBER UOF_NUMBER
## OFFICER_ID OFFICER_ID
## OFFICER_GENDER OFFICER_GENDER
## OFFICER_RACE OFFICER_RACE
## OFFICER_HIRE_DATE OFFICER_HIRE_DATE
## OFFICER_YEARS_ON_FORCE OFFICER_YEARS_ON_FORCE
## OFFICER_INJURY OFFICER_INJURY
## OFFICER_INJURY_TYPE OFFICER_INJURY_TYPE
## OFFICER_HOSPITALIZATION OFFICER_HOSPITALIZATION
## SUBJECT_ID SUBJECT_ID
## SUBJECT_RACE SUBJECT_RACE
## SUBJECT_GENDER SUBJECT_GENDER
## SUBJECT_INJURY SUBJECT_INJURY
## SUBJECT_INJURY_TYPE SUBJECT_INJURY_TYPE
## SUBJECT_WAS_ARRESTED SUBJECT_WAS_ARRESTED
## SUBJECT_DESCRIPTION SUBJECT_DESCRIPTION
## SUBJECT_OFFENSE SUBJECT_OFFENSE
## REPORTING_AREA REPORTING_AREA
## BEAT BEAT
## SECTOR SECTOR
## DIVISION DIVISION
## LOCATION_DISTRICT LOCATION_DISTRICT
## STREET_NUMBER STREET_NUMBER
## STREET_NAME STREET_NAME
## STREET_DIRECTION STREET_DIRECTION
## STREET_TYPE STREET_TYPE
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION
## LOCATION_CITY LOCATION_CITY
## LOCATION_STATE LOCATION_STATE
## LOCATION_LATITUDE LOCATION_LATITUDE
## LOCATION_LONGITUDE LOCATION_LONGITUDE
## INCIDENT_REASON INCIDENT_REASON
## REASON_FOR_FORCE REASON_FOR_FORCE
## TYPE_OF_FORCE_USED1 TYPE_OF_FORCE_USED1
## TYPE_OF_FORCE_USED2 TYPE_OF_FORCE_USED2
## TYPE_OF_FORCE_USED3 TYPE_OF_FORCE_USED3
## TYPE_OF_FORCE_USED4 TYPE_OF_FORCE_USED4
## TYPE_OF_FORCE_USED5 TYPE_OF_FORCE_USED5
## TYPE_OF_FORCE_USED6 TYPE_OF_FORCE_USED6
## TYPE_OF_FORCE_USED7 TYPE_OF_FORCE_USED7
## TYPE_OF_FORCE_USED8 TYPE_OF_FORCE_USED8
## TYPE_OF_FORCE_USED9 TYPE_OF_FORCE_USED9
## TYPE_OF_FORCE_USED10 TYPE_OF_FORCE_USED10
## NUMBER_EC_CYCLES NUMBER_EC_CYCLES
## FORCE_EFFECTIVE FORCE_EFFECTIVE
## new_date new_date
## yearwisehiring yearwisehiring
## var_type n missing_n
## INCIDENT_DATE <chr> 2383 0
## INCIDENT_TIME <chr> 2383 0
## UOF_NUMBER <chr> 2383 0
## OFFICER_ID <int> 2383 0
## OFFICER_GENDER <fct> 2383 0
## OFFICER_RACE <fct> 2383 0
## OFFICER_HIRE_DATE <chr> 2383 0
## OFFICER_YEARS_ON_FORCE <int> 2383 0
## OFFICER_INJURY <fct> 2383 0
## OFFICER_INJURY_TYPE <chr> 2383 0
## OFFICER_HOSPITALIZATION <fct> 2383 0
## SUBJECT_ID <dbl> 2383 0
## SUBJECT_RACE <fct> 2383 0
## SUBJECT_GENDER <fct> 2383 0
## SUBJECT_INJURY <fct> 2383 0
## SUBJECT_INJURY_TYPE <chr> 2383 0
## SUBJECT_WAS_ARRESTED <fct> 2383 0
## SUBJECT_DESCRIPTION <fct> 2383 0
## SUBJECT_OFFENSE <chr> 2383 0
## REPORTING_AREA <dbl> 2383 0
## BEAT <dbl> 2383 0
## SECTOR <dbl> 2383 0
## DIVISION <fct> 2383 0
## LOCATION_DISTRICT <fct> 2383 0
## STREET_NUMBER <chr> 2383 0
## STREET_NAME <chr> 2383 0
## STREET_DIRECTION <chr> 2383 0
## STREET_TYPE <chr> 2383 0
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION <chr> 2383 0
## LOCATION_CITY <chr> 2383 0
## LOCATION_STATE <chr> 2383 0
## LOCATION_LATITUDE <dbl> 2328 55
## LOCATION_LONGITUDE <dbl> 2328 55
## INCIDENT_REASON <fct> 2383 0
## REASON_FOR_FORCE <fct> 2383 0
## TYPE_OF_FORCE_USED1 <chr> 2383 0
## TYPE_OF_FORCE_USED2 <chr> 2383 0
## TYPE_OF_FORCE_USED3 <chr> 2383 0
## TYPE_OF_FORCE_USED4 <chr> 2383 0
## TYPE_OF_FORCE_USED5 <chr> 2383 0
## TYPE_OF_FORCE_USED6 <chr> 2383 0
## TYPE_OF_FORCE_USED7 <chr> 2383 0
## TYPE_OF_FORCE_USED8 <chr> 2383 0
## TYPE_OF_FORCE_USED9 <chr> 2383 0
## TYPE_OF_FORCE_USED10 <chr> 2383 0
## NUMBER_EC_CYCLES <chr> 2383 0
## FORCE_EFFECTIVE <chr> 2383 0
## new_date <date> 2383 0
## yearwisehiring <dbl> 2383 0
## missing_percent
## INCIDENT_DATE 0.0
## INCIDENT_TIME 0.0
## UOF_NUMBER 0.0
## OFFICER_ID 0.0
## OFFICER_GENDER 0.0
## OFFICER_RACE 0.0
## OFFICER_HIRE_DATE 0.0
## OFFICER_YEARS_ON_FORCE 0.0
## OFFICER_INJURY 0.0
## OFFICER_INJURY_TYPE 0.0
## OFFICER_HOSPITALIZATION 0.0
## SUBJECT_ID 0.0
## SUBJECT_RACE 0.0
## SUBJECT_GENDER 0.0
## SUBJECT_INJURY 0.0
## SUBJECT_INJURY_TYPE 0.0
## SUBJECT_WAS_ARRESTED 0.0
## SUBJECT_DESCRIPTION 0.0
## SUBJECT_OFFENSE 0.0
## REPORTING_AREA 0.0
## BEAT 0.0
## SECTOR 0.0
## DIVISION 0.0
## LOCATION_DISTRICT 0.0
## STREET_NUMBER 0.0
## STREET_NAME 0.0
## STREET_DIRECTION 0.0
## STREET_TYPE 0.0
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION 0.0
## LOCATION_CITY 0.0
## LOCATION_STATE 0.0
## LOCATION_LATITUDE 2.3
## LOCATION_LONGITUDE 2.3
## INCIDENT_REASON 0.0
## REASON_FOR_FORCE 0.0
## TYPE_OF_FORCE_USED1 0.0
## TYPE_OF_FORCE_USED2 0.0
## TYPE_OF_FORCE_USED3 0.0
## TYPE_OF_FORCE_USED4 0.0
## TYPE_OF_FORCE_USED5 0.0
## TYPE_OF_FORCE_USED6 0.0
## TYPE_OF_FORCE_USED7 0.0
## TYPE_OF_FORCE_USED8 0.0
## TYPE_OF_FORCE_USED9 0.0
## TYPE_OF_FORCE_USED10 0.0
## NUMBER_EC_CYCLES 0.0
## FORCE_EFFECTIVE 0.0
## new_date 0.0
## yearwisehiring 0.0
we have 110 total null values in the data set for the features that are important for the analysis where 55 are missing in the latitute and 55 are missing in longitude and these values make upto 4.6% total missing values for both the columns and such missing values are not significant and can be ignored. However, features with the names of type of force used 4 till 10 including the number_EC_cycles have majority of null values and they are redundant for the analysis. Therefore, deleting or ignoring such features is an appropriate way to deal with it.
In the last we will remove the duplicate rows from the data, if there is any. And the conditions will be based on multiple features containing date, time, UOF number and officer ID. This means that the few entries were duplicate and should be removed before further processing. Now, we have 2328 observations and 49 features because of 2 extra feature added for analysis.
policing <- policing[!duplicated(policing[c(1,2,3,4)]),]
dim(policing)
## [1] 2328 49
There are different type of visualizations which are but not limited to bar plots, histograms, scatter plots and the data set is about the officers, their ethnicity, sex group, number of years served, arrests made and about subjects, their crimes, sex group and ethnicity etc.
Before, we proceed for visualization, lets have the subset of the main data set that has unique values for the number of officers and subjects.
new_date <- as.Date(policing$INCIDENT_DATE, "%m/%d/%Y")
policing['new_date'] <- new_date
hiringbyy <- as.Date(policing$OFFICER_HIRE_DATE, "%m/%d/%Y")
policing['yearwisehiring'] <- year(hiringbyy)
xgen <- distinct(policing, OFFICER_ID, OFFICER_RACE, .keep_all= TRUE)
require(ggcorrplot)
## Loading required package: ggcorrplot
## Warning: package 'ggcorrplot' was built under R version 4.1.3
corr.dataset = policing.numeric %>% select(SECTOR, BEAT, REPORTING_AREA, SUBJECT_ID, OFFICER_ID, OFFICER_YEARS_ON_FORCE)
corr = cor(corr.dataset)
corr.p = cor_pmat(corr)
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3,
outline.col = "black", ggtheme = ggplot2::theme_bw(),
colors = c("yellow", "green", "blue"))
We can see that the correlation plot to identify that “officer_id” and “officer_years_on_force” has a negative correlation with the coefficient value of -0.92 whereas “beat” & “sector” feature has a positive correlation with the value of 1.
table(policing$OFFICER_GENDER)
##
## Female Male
## 236 2092
distinctvals <- policing %>%
group_by(OFFICER_GENDER) %>%
summarise(n=n_distinct(OFFICER_ID))
distinctvals
## # A tibble: 2 x 2
## OFFICER_GENDER n
## <fct> <int>
## 1 Female 126
## 2 Male 915
gender = ggplot(data = policing, aes(x=OFFICER_GENDER))
gender1 = gender + geom_bar(fill="#fb8072", alpha=0.4) +
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "Blue") +
xlab("GENDER") + ylab("Number of Officers")
gender2 = gender1 + theme_bw() + theme(axis.title = element_text(size = 12, color = "blue"), axis.text = element_text(size = 10, face = "bold"))
gender2
we have 2092 male officers and 236 female officers. But this count is without removing the duplicates. Indeed, the real figures of the officers are plotted below:
distgenderplot <- policing %>%
group_by(OFFICER_GENDER) %>%
summarise(n=n_distinct(OFFICER_ID)) %>%
ggplot(.,aes(OFFICER_GENDER, n)) + geom_bar(stat = 'identity', fill="#fb8072", alpha=0.5) + ylim(0,1000) +
xlab("Gender Segregation") + ylab("Number of Officers")
distgenderplot1 = distgenderplot + theme_bw() + theme(axis.title = element_text(size = 12, color = "blue"), axis.text = element_text(size = 10, face = "bold"))
distgenderplot1
Actually, we have 126 female officers and 917 male officers.
Lets have a look at the number of years officers served.
(hist(x=xgen$OFFICER_YEARS_ON_FORCE, breaks = 4, ylim = c(0,1000), col = 2:6, main = "Yearly Grouped Segregation of number of years served", ylab = "Number of officers", xlab = "No of Years", border = "White", labels = TRUE))
## $breaks
## [1] 0 10 20 30 40
##
## $counts
## [1] 729 167 128 19
##
## $density
## [1] 0.069894535 0.016011505 0.012272291 0.001821668
##
## $mids
## [1] 5 15 25 35
##
## $xname
## [1] "xgen$OFFICER_YEARS_ON_FORCE"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
text(5, 325, "Majority", srt = 90)
We can see that 729 officers have 0 to 10 years of experience which makes them the majority in the total group of officers whereas on the otherside, police department has very less officers that are experienced and have over 20 years of experience and their count is only 147. If we talk about the officers that have 30 to 40 years of service, it can be predicted that they are at the end of their tenure and will soon be retiring. Also, such officers are well placed in office and may not take part in ground activities.
Lets have another look at the same thing with the interactive plot, where we can see the specific number of officers with respect to their years of service.
yearsserved = ggplot(data=xgen, aes(x=OFFICER_YEARS_ON_FORCE))
noofyears = yearsserved + geom_histogram(binwidth = 1, fill="#2b8cbe", alpha=0.6) +
xlab("Years Served") + ylab("Number of officers") + xlim(0,40) + ylim(0,120)
ggplotly(noofyears)
Lets have a look at how many different ethnicity officers we have and the gender of those officers with in those ethnical groups
racegender <- table(xgen$OFFICER_GENDER, xgen$OFFICER_RACE)
racegender
##
## American Ind Asian Black Hispanic Other White
## Female 1 5 29 26 3 62
## Male 6 22 149 204 6 530
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 4.1.3
xgen <- distinct(policing, OFFICER_ID, OFFICER_RACE, .keep_all= TRUE)
racegender1 <- table(xgen$OFFICER_GENDER, xgen$OFFICER_RACE)
xgender <- barplot(as.matrix(t(racegender1)), beside = TRUE, main = "Ethnic segregation of genders", ylab = "Number of Officers", xlab = "Gender",
legend.text = TRUE,
args.legend=list(bty="7", x = "topleft" , ncol = 3), ylim = c(0,800),
col=brewer.pal(7,"Set1"), border="Black")
ygender<-as.matrix(t(racegender1))
text(xgender, ygender+50, labels = as.character(ygender))
Now we will have a look at the number of distribution that the dallas police have.They have actually distributed the dallas in two ways that are division and location. Lets have a look at number of locations and divisions for the same.
distgen <- xgen %>%
group_by(DIVISION) %>% summarise(count=n_distinct(OFFICER_ID))
distgen
## # A tibble: 7 x 2
## DIVISION count
## <fct> <int>
## 1 CENTRAL 197
## 2 NORTH CENTRAL 132
## 3 NORTHEAST 146
## 4 NORTHWEST 114
## 5 SOUTH CENTRAL 126
## 6 SOUTHEAST 162
## 7 SOUTHWEST 165
distgen <- xgen %>%
group_by(DIVISION) %>% summarise(count=n_distinct(OFFICER_ID))
distgenplot <- plot(distgen$DIVISION, distgen$count, main = "Division wise count of officers", col = 2, xlab = "DIVISONS", ylab = "Number of officers")
distgenplot
## $stats
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 197 132 146 114 126 162 165
## [2,] 197 132 146 114 126 162 165
## [3,] 197 132 146 114 126 162 165
## [4,] 197 132 146 114 126 162 165
## [5,] 197 132 146 114 126 162 165
##
## $n
## [1] 1 1 1 1 1 1 1
##
## $conf
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 197 132 146 114 126 162 165
## [2,] 197 132 146 114 126 162 165
##
## $out
## numeric(0)
##
## $group
## numeric(0)
##
## $names
## [1] "CENTRAL" "NORTH CENTRAL" "NORTHEAST" "NORTHWEST"
## [5] "SOUTH CENTRAL" "SOUTHEAST" "SOUTHWEST"
Lets have a look at number of locations in a single division.
length(table(policing$LOCATION_DISTRICT))
## [1] 14
locindiv <- policing %>% group_by(DIVISION) %>% summarise(noofdistricts = n_distinct(LOCATION_DISTRICT)) %>% arrange(desc(noofdistricts))
locindiv
## # A tibble: 7 x 2
## DIVISION noofdistricts
## <fct> <int>
## 1 NORTHEAST 6
## 2 CENTRAL 5
## 3 SOUTHWEST 5
## 4 NORTHWEST 4
## 5 SOUTH CENTRAL 4
## 6 SOUTHEAST 4
## 7 NORTH CENTRAL 3
As the tables above shows that dallas has 7 divisions and 14 districts. And, northeast division is dealing with large number of districts for the policing in dallas city as well as handling with criminal offences.
DIVISIONS <- locindiv$DIVISION
clrgenre <- c('#8dd3c7','#ffffb3','#bebada','#fb8072','#80b1d3','#fdb462','#b3de69', '#000000')
ggplot(locindiv, aes(x=noofdistricts, fill=DIVISIONS)) + geom_histogram(binwidth=1) + scale_fill_manual(values = clrgenre) +
labs(x = "Number of Districts",
y = "Number of Divisions Covering Districts",
title = "Graph Visualizing number of districts covered by divisions")
The graph above shows that north central is dealing with 3 districts whereas southeast, south central and northwest are dealing with 4 districts. There is one division that is dealing with 6 districts and it is “North East”.
Lets have a look at the officers getting injured and hospitalized in accordance with their years of service.
ggplot(policing,
aes(x = OFFICER_YEARS_ON_FORCE,
fill = OFFICER_INJURY)) +
geom_density(alpha = 0.4) +
labs(title = "Years of experience distribution by officer injury")
The graph above shows that the experience with less experience are injured the most and the officers with more experience gets less injuries.
library(ggplot2)
ggplot(policing,
aes(x = factor(OFFICER_HOSPITALIZATION,
labels = c("Not Hospitalized",
"Hospitalized")),
y = OFFICER_YEARS_ON_FORCE,
fill=OFFICER_HOSPITALIZATION)) +
geom_boxjitter(color="black",
jitter.color = "darkgrey",
errorbar.draw = TRUE) +
labs(title = "Officer Hospitalization by Years of Experience",
subtitle = "9-month salary for 2008-2009",
x = "Hospitalization Status",
y = "Years of Experience") +
theme_minimal() +
theme(legend.position = "none")
Same is the case with hospitalization as it can be seen from the graph above that the officers with less than 10 years of experience are hospitalized more often as compared to the officers with experience of more than 10 years.We can have another look on officer getting injured and getting hospitalized or maybe it was just an injury but they were not hospitalized.
datenew <- policing
datenew1 <- datenew %>% mutate(hireyear = mdy(OFFICER_HIRE_DATE)) %>%
mutate(hyear = year(hireyear))
library(forcats)
xgennew <- datenew1
levels(datenew1$OFFICER_HOSPITALIZATION) <- list(NotHospitalized = "No", Hospitalized = "Yes")
levels(datenew1$OFFICER_INJURY) <- list(NotInjured = "No", Injured = "Yes")
G = ggplot(xgennew, aes(x=hyear, y =OFFICER_YEARS_ON_FORCE)) + geom_boxplot(aes(group = hyear))
(G = G + facet_grid(OFFICER_HOSPITALIZATION ~ OFFICER_INJURY) + labs(title = "Officer Injury & Hospitalization",
x = "Hiring Year",
y = "Years of Experience"))
The graph above shows the officers population who is hospitalized when injured or when not injured.
Now lets have a look at the officers who made arrests and who did not and this is with respect to their years of experience and the year hired.
datenew1 <- datenew1 %>% mutate(hiringbyyear= case_when(
hyear >= 2000 & hyear <= 2010 ~ "2001 to 2010",
hyear > 1978 & hyear < 2000 ~ "till 1990",
TRUE ~ "after 2010")
)
G = ggplot(datenew1, aes(hyear, OFFICER_YEARS_ON_FORCE, color = hiringbyyear)) + geom_point()
G = G + scale_color_manual(values = c("blue", "yellow", "red"))
New_names <- c(`Yes` = "Made Arrest", `No` = "Did Not Make Arrest")
arrestsmade <- G + facet_wrap(~ SUBJECT_WAS_ARRESTED, labeller = as_labeller(New_names)) + labs(title = "Officer Arrest Status",
subtitle = "Categorized Hiring Date & their arrest record",
x = "Hiring Year",
y = "Years of Experience")
arrestsmade
distcrimes <- policing %>%
summarise(n=n_distinct(SUBJECT_ID))
distcrimes
## n
## 1 1433
So there are 1433 crimes that are reported in the year 2016.
Lets segregate the crimes reported in each month for 2016.
new_date_crime <- as.Date(policing$INCIDENT_DATE, "%m/%d/%Y")
policing['crime_date'] <- new_date_crime
policing['crime_month'] <- month(new_date_crime)
(hist(x=policing$crime_month, breaks = 8, ylim = c(0,1000), col = 2:6, main = "Monthly count of crimes reported", ylab = "Number of crimes", xlab = "Months", border = "White", labels = TRUE))
## $breaks
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
##
## $counts
## [1] 474 259 223 215 184 179 182 201 164 147 100
##
## $density
## [1] 0.20360825 0.11125430 0.09579038 0.09235395 0.07903780 0.07689003
## [7] 0.07817869 0.08634021 0.07044674 0.06314433 0.04295533
##
## $mids
## [1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5
##
## $xname
## [1] "policing$crime_month"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
text(1.5, 200, "High Count", srt = 90)
The graph above clearly shows that the crimes reported in the month of January are very high. There is another graph which visualizes the same thing in an appropriate way.
datenew <- policing
library(lubridate)
library(DT)
## Warning: package 'DT' was built under R version 4.1.3
datenew <- policing %>%
mutate(new_inc_date = mdy(INCIDENT_DATE)) %>%
mutate(CRIMEMONTH = month(new_inc_date, label =TRUE)) %>%
mutate(CRIMEDAY = day(new_inc_date)) %>%
mutate(CRIMEWEEK = wday(new_inc_date, label = TRUE))
datenew %>%
count(CRIMEMONTH) %>%
mutate(`Number of Crime` = n) %>%
select(CRIMEMONTH, `Number of Crime`)%>%
DT::datatable(colnames = c("Month", "Number of Crimes"),
options = list(pageLength = 12))
library(plotly)
monthly <-datenew %>%
count(CRIMEMONTH) %>%
mutate(`Number of Crime` = n) %>%
ggplot() +
aes(x = CRIMEMONTH, weight = `Number of Crime`) +
geom_bar(fill = "green", width = 0.3) +
labs(
x = "Month\n",
y = "\nNumber of Crime\n",
title = "Number of crime by Month",
subtitle = "Year:2016"
) +
theme_light() ->m
ggplotly(
m,
tooltip = c("CRIMEMONTH", "Number of Crime")
)
The graph above is the interactive plot showing the number of crimes that are reported in each month.
subgen <- policing %>%
group_by(SUBJECT_GENDER) %>%
summarise(n=n_distinct(SUBJECT_ID))
subgen$fraction <- subgen$n / sum(subgen$n)
subgen$ymax <- cumsum(subgen$fraction)
subgen$ymin <- c(0, head(subgen$ymax, n=-1))
subgen$labelPosition <- (subgen$ymax + subgen$ymin) / 2
subgen$label <- paste0(subgen$SUBJECT_GENDER, "\n value: ", subgen$n)
newsubgen <- subgen[-c(3, 4), ]
(ggplot(newsubgen, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=SUBJECT_GENDER)) +
geom_rect() +
geom_text( x=5.2, aes(y=labelPosition, label=label, color=SUBJECT_GENDER), size=6) + # x here controls label position (inner / outer)
scale_fill_brewer(palette=5) +
scale_color_brewer(palette=5) +
coord_polar(theta="y") +
xlim(c(-1, 4)) +
theme_void() +
theme(legend.position = "none") +
ggtitle("No of Female and Male reported for crime"))
So, we have 1170 males and 260 females who are reported for crimes.
Lets have a look at crimes segregated by ethnicity.
require(rAmCharts)
## Loading required package: rAmCharts
## Warning: package 'rAmCharts' was built under R version 4.1.3
## Full amcharts.js API available using amChartsAPI()
## Look at rAmCharts::runExamples() & http://datastorm-open.github.io/introduction_ramcharts/
## Bug report or feed back on https://github.com/datastorm-open/rAmCharts
##
## Attaching package: 'rAmCharts'
## The following object is masked from 'package:plotly':
##
## api
library(dplyr)
piedata <- policing %>%
group_by(SUBJECT_RACE) %>%
count() %>%
ungroup() %>%
mutate(percentage = `n` / sum(`n`)) %>%
arrange(percentage) %>%
mutate(labels = scales::percent(percentage))
piedata %>% dplyr::select(label=SUBJECT_RACE, value = n) %>%
amPie(., inner_radius = 25,
depth = 6,
show_values = TRUE,
legend = TRUE,
export = TRUE,
main = "Percent of criminal by Race")
Lets have a look at the arrests of subjects. First, we will look at the total subjects that were arrested. Then we will look at the total subjects that were injured. And finally, we will have a look at subjects that were arrested and also injured. And, a numerical analysis is used here.
table(policing$SUBJECT_WAS_ARRESTED)
##
## No Yes
## 317 2011
table(policing$SUBJECT_INJURY)
##
## No Yes
## 1704 624
The tables above show that out of the total policing records of 2016, 2011 were arrested and 317 were not arrested. And according to the complete record of 2016, 624 were injured and 1704 were not injured. Now lets have a look at the number of injured criminals who were arrested.
arrtoinj <- policing %>% filter(SUBJECT_WAS_ARRESTED == 'Yes') %>% group_by(SUBJECT_INJURY) %>% summarise(length(SUBJECT_INJURY))
arrtoinj
## # A tibble: 2 x 2
## SUBJECT_INJURY `length(SUBJECT_INJURY)`
## <fct> <int>
## 1 No 1435
## 2 Yes 576
So, out of 2011 arrested criminals, 576 were injured and remaining 1435 remained uninjured.
library(dplyr)
library(ggplot2)
datenew1 %>% count(SUBJECT_DESCRIPTION) %>%
ggplot(aes(x= reorder(SUBJECT_DESCRIPTION, n), y = n)) +
geom_col(fill = "#756bb1") +
labs(x = "Offense Type",
y = "Number of Crimes",
title = paste0("Offense types reported in 2016")) +
coord_flip() +
theme_minimal()
The graph above displays the subjects description with respect to the number of crimes committed. We can see that most of the crimes are committed by mentally unstable people and alcoholic individuals.
library(dplyr)
library(ggplot2)
datenew1 %>% count(REASON_FOR_FORCE) %>%
ggplot(aes(x= reorder(REASON_FOR_FORCE, n), y = n)) +
geom_col(fill = "#f46d43") +
labs(x = "Reason for force",
y = "Number of times force used",
title = paste0("Reason of force used")) + coord_flip() +
theme_minimal()
And the graph above shows the number of times force is used and the reason for which force is used.
Lets plot the map to look the crime map of dallas.
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.1.3
policing %>%
leaflet(.) %>%
addTiles() %>%
addCircleMarkers(lng = ~LOCATION_LONGITUDE,
lat = ~LOCATION_LATITUDE,
radius=0.01,
fillOpacity = 0.001)
## Warning in validateCoords(lng, lat, funcName): Data contains 55 rows with either
## missing or invalid lat/lon values and will be ignored
Lets have a look at the locations, where most of the crimes are committed.
library(ggmap)
## Warning: package 'ggmap' was built under R version 4.1.3
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
##
## Attaching package: 'ggmap'
## The following object is masked from 'package:plotly':
##
## wind
sbbox <- make_bbox(lon = c(policing$LOCATION_LONGITUDE), lat = c(policing$LOCATION_LATITUDE), f = .1)
dallas = get_map(location=sbbox, zoom=10, maptype="terrain")
## Source : http://tile.stamen.com/terrain/10/236/412.png
## Source : http://tile.stamen.com/terrain/10/237/412.png
## Source : http://tile.stamen.com/terrain/10/236/413.png
## Source : http://tile.stamen.com/terrain/10/237/413.png
dallas1 = ggmap(dallas)
dallas1 +
geom_point(data = policing, mapping = aes(x =LOCATION_LONGITUDE, y =LOCATION_LATITUDE),
color = "red") +
geom_text(data = policing,
mapping = aes(x = LOCATION_LONGITUDE+0.1,
y = LOCATION_LATITUDE,
label = "Crime Spots"),
size = 2, color = "gray20",
fontface = "bold",
check_overlap = T)
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 71 rows containing missing values (geom_text).
datenew1[, 12] <- as.numeric(datenew1[, 12])
datenew1[, 20] <- as.numeric(datenew1[, 20])
datenew1[, 21] <- as.numeric(datenew1[, 21])
datenew1[, 22] <- as.numeric(datenew1[, 22])
datenew1[, 32] <- as.numeric(datenew1[, 32])
datenew1[, 33] <- as.numeric(datenew1[, 33])
names(datenew1)[names(datenew1) == "LOCATION_LONGITUDE"] <- "long"
names(datenew1)[names(datenew1) == "LOCATION_LATITUDE"] <- "lat"
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 3 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 3 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 4 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 6 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 6 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 6 rows with either
## missing or invalid lat/lon values and will be ignored
## Assuming "long" and "lat" are longitude and latitude, respectively
## Assuming "long" and "lat" are longitude and latitude, respectively
## Warning in validateCoords(lng, lat, funcName): Data contains 11 rows with either
## missing or invalid lat/lon values and will be ignored